Ultra-rapid somatic variant detection via real-time targeted amplicon sequencing

Molecular markers are essential for cancer diagnosis, clinical trial enrollment, and some surgical decision making, motivating ultra-rapid, intraoperative variant detection. Sequencing-based detection is considered the gold standard approach, but typically takes hours to perform due to time-consuming DNA extraction, targeted amplification, and library preparation times. In this work, we present a proof-of-principle approach for sub-1 hour targeted variant detection using real-time DNA sequencers. By modifying existing protocols, optimizing for diagnostic time-to-result, we demonstrate confirmation of a hot-spot mutation from tumor tissue in ~52 minutes. To further reduce time, we explore rapid, targeted Loop-mediated Isothermal Amplification (LAMP) and design a bioinformatics tool—LAMPrey—to process sequenced LAMP product. LAMPrey’s concatemer aware alignment algorithm is designed to maximize recovery of diagnostically relevant information leading to a more rapid detection versus standard read alignment approaches. Using LAMPrey, we demonstrate confirmation of a hot-spot mutation (250x support) from tumor tissue in less than 30 minutes.

. Molecular diagnostic techniques and associated rapid time-frames. Sequencing-based diagnostics have not been accomplished within the intra-operative time-frame. *qPCR variant allele fractions are inferred from comparison to wildtype allele amplification or a standard curve and have not been demonstrated intra-operatively.
**Sequencing can identify any mutation that lies within an amplicon; therefore, the number of reportable target mutations is at least 4^n (where 'n' is the length of the amplicon(s) under evaluation). ***This work does not evaluate copy number variation detection, but should be capable of detecting these types of mutations given the addition of an assay targeting a copy-number control gene. FISH = Fluorescent In-Situ Hybridization; IHC = Immunohisto Chemistry; RT-qPCR = reverse transcription quantitative polymerase chain reaction; qPCR = quantitative polymerase chain reaction.

Cycling Parameters
Primer sets used in all evaluations are listed in Table S1 below. ONT handshake sequence tails are highlighted in red and blue. For tailed primers, both the genomic length and total amplicon length including the handshake sequences (bolded) are listed. For FIP and BIP primers, poly-T bend segments are bolded.   We were successfully able to sequence and align all amplicons using standard bioinformatics pipelines (see Methods section). However, various inefficiencies were noted ( Figure S1a). Some sequenced reads were not able to be successfully mapped to the human genome. This loss was between 8%-20% (Mapping Loss; Figure S2b) among all amplicon sizes and decreased as the amplicon length increased. This indicates that longer amplicons tend to fragment into longer reads which are generally easier to map.
Based on the final amplified mass relative to the input template genomic DNA (measured via Qubit, Invitrogen dsDNA HS Assay #Q33230), we also would expect amplicons to be sequenced in roughly the same proportion. However, we observed a bias away from this expectation in favor of background genomic DNA (Expectation Letdown; Figure S1b).
This indicates the transposome favors interacting with, and fragmenting longer genomic reads versus shorter amplicons 12 . While a trend favoring longer amplicons exists, even the shortest amplicon (187bp) incurred a loss of only 6.9% (versus 2.2% for the 910bp product) relative to amplified mass, an acceptable loss.
Fragmentation loss (Fragmentation Loss; Figure S1b)-defined here as the reads that map to the target region but do not cover the target locus-is notably worse for longer amplicons. This is due to longer amplicons fragmenting in multiple locations, producing a higher ratio of strands in the library that do not contain the target locus. The transposome will also fragment amplicons near their ends, producing both nearly full-length amplicons, and fragments that are too short to map properly map. This leads to an uneven distribution of aligned read lengths ( Figure S1c  These results indicate it is beneficial to design primer sets that generate relatively short amplicons with any hot-spot target of interest near the center of the amplicon (to reduce fragmentation loss), but long enough to preserve the proportion of useful, mappable reads.

Supplementary Note 3: PCR Parameter Optimization and Efficiency Evaluation
Faster PCR cycle parameters can lead to significant reductions in end-to-end amplification time.
Using the best performing primer set evaluated (260bp) we performed cycle parameter optimization, reducing cycle times 13 while balancing reaction efficiency. Due to the relatively small amplicon size allowed by the ONT rapid library preparation chemistry, PCR cycle times were able to be significantly reduced versus prior work 10 but did end up relatively close to suggested lowerbound times ( Figure S2a). Estimated end-to-end PCR times were modeled based on 3C/s ramp rate ( Figure S2b). PCR paper time ignores ramp rate penalty and any handling penalties. Optimized PCR time is dominated by ramp rate of standard lab thermocyclers, and could be further reduced by using more rapid thermocyclers. Figure S2. 3-step PCR assay optimized parameters, comparison to baseline protocols, and performance over various cycle counts. (a) cycle times for a baseline amplification protocol 10 , the standard NEB protocol, and our our optimized parameters. (b) on-paper times differ significantly from modeled and measured times during experimentation.
As cycle parameters are reduced, thermocycler ramp time begins to dominate total amplification time leading to a >50% efficiency loss (~10min penalty) and a reduction in the marginal benefit of cycle time reduction. (c) PCR performed well, producing a high-proportion of on-target amplicons after 26 cycles.
The proper cycle threshold for optimal end-to-end diagnostic performance can be estimated by sequencing product from PCR with varying numbers of cycles. We performed PCR using 1:10 diluted LQE extracted DNA (see Methods section). We then measured both the resulting PCR product concentration via Qubit fluorometer (Invitrogen; HS dsDNA Assay; #Q33230) and sequenced each resulting product using a barcoded, multiplexed rapid library preparation methodology (ONT; SQK-RBK004). Resulting reads were aligned to the human reference and classified according to alignment (see Methods section). Over time, we see both the total mass, as well as the proportion of PCR product relative to background genomic reads grow ( Figure S2c).

Supplementary Note 4: Impact of Rapid Adapter Incubation Time
We tested four different rapid adapter incubation times from 5 minutes to 2 minutes. Five minutes is the incubation time recommended by ONT's Rapid Barcoding library preparation kit (SQK-RBK004). Reactions were started at successive 1-minute intervals to assure all incubations ended at the same time. Once incubations finished, each reaction was immediately and independently mixed with ¼ ONT pre-mixed sequencing mix. This was to prevent ONT loading beads from preferentially binding to amplicons first added to one pooled sequencing mix, as observed in earlier experiments. All reactions were then mixed, combined, and sequenced until the lowest target depth reached >5,000 target reads. Fast5 files were basecalled using Guppy V4.2.2 and aligned using Minimap 14 v2.17. Adapter efficiency was estimated by comparing the relative read counts of various time points. No upward trend in adapter efficiency was noted ( Figure S2), indicating that a 2-minute incubation time is sufficient to create a library of sufficient quality to sequence and rapidly diagnose hotspot mutations.

Supplementary Note 5: LQE Dilution Efficiency and MinION Sequencing Feasibility
To test the impacts on dilutions of the Lucigen QuickExtract (LQE) extracted DNA on PCR efficiency, two primer sets were designed to target exon 5 of the porcine (sus scrofa) TP53 gene.
We used representative aliquots (~20mg) of pig brain tissue and followed the standard extraction protocol. Serial dilutions of extract were made and amplified using the following cycling parameters: 30s @ 98°C; then 35 cycles of 10s @ 98°C, 10s @ 62°C, 20s @ 72°C; then a final extension of 2 minutes @ 72°C. Amplification efficiency was measured via inspection of band brightness following gel electrophoresis ( Figure S3). Dilutions of 8x-32x were substantially more efficient than 1x-4x with a 10x dilution (suggested by the manufacturer) performing the best across both primer sets. We then performed an extraction on an aliquot of pediatric DIPG tumor and normal brain tissue following the standard LQE protocol and performed PCR with 10ul of 10x diluted DNA (1ul LQE product in 9ul nuclease free water). After confirming amplification, we sequenced product from 26 cycles of PCR to confirm that this product was sequence-able and produced valid reads and identified the known HIST1H3B K27M variant. We were able to successfully sequence and confirm the variant using our standard informatics pipelines, basecalling using ONTs Guppy basecaller and aligning reads to the human reference using minimap2 version 2.17. A read pileup visualized using integrated genome viewer is shown below.

Supplementary Note 6: LQE Incubation Time Efficiency
DNA from prior standard protocol extractions was compared to DNA extracted from a separate aliquot of tumor tissue from the same patient. Extracted DNA was amplified using our primary H3F3A K27M LAMP assay in a Eppendorf Mastercycler, prepared using the suggested amount of dye. Amplification and fluorescence monitoring was performed at 65°C on an Eppendorf Mastercycler with 15s measurement intervals (cycles). Fluorescence curves are shown in Figure   S4. The assay showed almost identical performance between the two extraction methodologies.
Note that due to the inability to pre-heat the real-time fluorescence machine, and the difficulty of identifying the ratio of background DNA to the amplifying target, time-to-amplification and the sequencing threshold seems to be underestimated compared to our ultra-rapid protocol determined by time-sweep sequencing.

Supplementary Note 7: Diagnostic Time Optimization Approach
Target amplification increases the target signal over background genomic noise and typically involves conservative "end-point" protocols that maximize amplification (e.g. 35+ PCR cycles) at the expense of time. For a time-optimal diagnostic, amplification should only be performed if the time-to-result benefit outweighs the time cost of further amplification. We used an analytical model to identify this "worthwhile" threshold for amplification time and suggest protocols that attempt to optimize for time-to-result.
As an illustrative example of our method, consider perfectly efficient PCR-based amplification where the target amplicon mass doubles every cycle. Given any amount of input DNA from any number of PCR cycles, we can model total diagnostic time using the formulas shown in Figure   S7a. As amplicon mass grows, the proportion of total mass that is background genomic material (gDNA) reduces ( Figure S7b) and thus the sequencing rate of target amplicons (versus background genomic DNA) increases. Assuming a 250x target coverage requirement for diagnosis, at a certain point, it becomes a waste of time to continue amplification, and instead becomes worthwhile to begin sequencing. Figure S9c shows an illustration of this trade-off using optimistic sequencing and PCR cycling parameters and a more sophisticated model of sequencing time (Table S4). Assuming perfect PCR efficiency, the optimal amount of amplification is 16 cycles before further amplification results in wasted diagnostic time. This represents a savings of ~12-22 minutes over a typical PCR-protocol.
To apply this methodology in practice, we first estimate useful target amplicon fraction and ONT flow-cell sequencing rates experimentally for a particular assay and tissue type (e.g. Figure S9a), and then use the sequencing performance model to suggest amplification times that aim to produce a time-optimal result ( Figure S7b).  Figure S9 develops a model framework to analytically derive the optimal amount of time spent on amplification that results in a minimal total diagnostic time (the optimal amplification threshold).

Optimal Amplification Threshold Sequencing Model Example Parameters
The equations used to build the model and the corresponding input parameters used to generate Where ̂ is the observed variant proportion, is the target support (mutant and wild-type calls), and is the Z-value for the desired confidence level.
Given that ONT sequencers and corresponding basecallers have relatively high error rates, this confidence interval can be compared to the confidence interval for the expected error rate for that context. We call a variant statistically significant when the lower-bound confidence interval for the observed variant proportion exceeds the upper-bound confidence interval for the expected error rate proportion. The error rate for this locus is known to be ~1% based on prior characterization in Bruzek et al. 15 , but we conservatively assume 1.5% for this analysis. Sequencing and basecaller errors are highly dependent on oligonucleotide sequence context and are much higher in low-complexity regions featuring homopolymers. This context-dependent error rate should be considered when calling variants in such regions. An example for a VAF proportion of 10% is shown in Figure S5. Alternatively, a healthy, wildtype sample can be pre-sequenced to characterize the error rate for a particular tissue, assay, target locus, and basecaller version.
Future basecalling accuracy improvements will help reduce this error and lower sequencing depth requirements. Figure S8. Confidence intervals for simulated observed VAF of 10% (orange) relative to the confidence interval for estimated background 1.5% error rate (blue) plotted for various desired confidence levels (95%, 99%, 99.9%, 99.99%). For each confidence level, we identify a variant call as statistically significant when the lower bound confidence interval of the observed variant fraction becomes larger than the upper bound confidence interval of the background error (red diamonds). Underlying VAF, error rate for that context, and desired confidence level greatly impact the locus support required to call a variant. In practice, it is best to apply these tests in real time. Even for a relatively low VAF of 10%, ~270 reads is expected to be sufficient to call the variant with 99.9% confidence.

Supplementary Note 10: LAMPrey Algorithm Description
LAMPrey is an algorithm for processing a set of FASTQ reads, attempting to identify properly formed LAMP concatemers (lamplicons) and call variants based off of a pileup of lamplicon subreads. As input, the user supplies a FASTQ file of sequenced LAMP product and a corresponding primer sequence file that details the 6-8 design sequences identified in LAMP primer design (B3, B2, B1, F1, F2, F3, BLP, FLP), as well as other sequences of interest such as ONT barcodes or adapter sequences. The user also describes a "target" sequence that spans the location where the mutation of interest lies. The user also specifies where this target sequence lies in the LAMP design sequence schema. A target sequence can exist anywhere between the F2 and B2 primers as long as it is not covered by another primer.
Sequence Alignment: LAMPrey first parses the FASTQ read file and optionally extracts the reads according to the ONT Guppy basecaller emitted start_time timestamp. This is assumed to be the approximate order and time that the reads are generated. LAMPrey then considers each read and aligns each sequence in the primer file to the read using a standard Smith Waterman alignment. Alignments are recorded if their identity is above a certain threshold (default of 75%).
Once all possible primers are marked, overlapping aligned sequences are removed, prioritizing alignments with higher identity. After sequence alignment, reads can be immediately classified into one of five different categories.
Classification: If the read does not contain any aligned target sequences, it is diagnosed as either an amplicon fragment, background genomic DNA, spurious LAMP amplification, an ONT fragment leftover from library preparation, or Unknown (further broken down into "short" unknown sequences (<60bp) and other Unknown). Amplicon fragments are those that align to the gene or target region of interest within some expected distance (e.g. 1000bp) but do not contain the target sequence. These reads are expected given the fragmentation step included in library preparation.
If the read aligns somewhere else in the human genome, it is marked as a background genomic read. These are also expected given that our protocol optimization approach attempts to skip target enrichment to save diagnostic time. If a read has a large proportion of primer sequences covering the read, but does not contain a target sequence, it is marked as suspected spurious amplification. These are assumed to be hybridized primers or other undesired amplification that does not capture the diagnostically relevant information of the patient, but still involves the basic LAMP machinery and at least some primer sequences. These reads are not expected but are difficult to fully avoid without extensive optimization of the LAMP assay. If a read has a high proportion of ONT related sequences covering a large proportion of the read, it is marked as an ONT fragment. ONT fragments could be unligated adapter sequences, or read fragments that contain too little target or background DNA to successfully align to the reference. If a read cannot be classified as any of the other categories, it is marked as unknown, with reads shorter than 60bp marked as "short".
Chaining: Once a lamplicon with at least one target sequence is identified, it is further processed to divide up the possible concatemer into its sub-reads. This is accomplished by considering each target sequence and looking for expected primer sequences to the 5' and 3' ends based on the LAMP assay. This is analogous to the chaining of read hits in many common read mapping algorithms. For example, if a forward target sequence exists between the F1 and B1 regions, we would look to its 5' end to find an aligned F1 sequence, and to its 3' end to find the complement of the B1 sequence. In this way, target sequences are "extended" in the 5' and 3' direction until an aligned sequence that doesn't match the expected ordering is found (e.g. an ONT barcode, adapter sequence, or FIP/BIP loop transition), or the end of the read is found. This extension defines a sub-read. Each sub read is extracted and considered separately.

Supplementary Note 11: Sources of diagnostic information missed by standard informatics pipeline
This investigation revealed four major failure modes of our original informatics pipeline explaining LAMPrey's improved performance: 1. Off-target alignment, and filtering of non-primary alignments (Figure 6e i): sequenced lamp concatemers are sometimes mapped to off-target locations.
Correct alignments can be included as secondary or supplementary alignments by the mapper but are filtered out by the standard diagnostic pipeline.
2. Imbalanced, fragmented concatemers (Figure 6e ii): the fragmentation-based library preparation approach can leave concatemers imbalanced, where two different-sized sub-reads exist: one shorter with the target information, and one longer without. Read mapping algorithms will typically score the longer sub-read as the primary alignment, and soft clip the shorter, diagnostically relevant section. LAMPrey did miss a small percentage of targets identified by our standard pipeline (Figure 6e).
Upon further inspection, these reads were missed due to LAMPrey failing to identify target seeds in the initial seeding stage due to many basecalling errors. Tuning long-read aligner parameters and investigating one of many other alignment algorithms to improve performance is one possible approach to improve recovery of diagnostically relevant information. However, the complexity of LAMP concatemers coupled with a fragmentation-based library preparation approach and the above results clearly motivates a LAMP-specific analysis and further development of tools such as LAMPrey.

Supplementary Note 12: End-to-End LAMP-based Ultra-rapid Protocol
The final end-to-end protocol is available in updateable electronic format at protocols.io (https://www.protocols.io/view/ultra-rapid-sequencing-lamp-btvmnn46). A visual representation approximating the time-frame of the protocol as performed in this work is shown in Figure S8 along with a step-by-step protocol. This final protocol combines DNA extraction, amplification, and library preparation into one thermocycler program to save time and also simplify equipment usage.